23 research outputs found
Format Abstraction for Sparse Tensor Algebra Compilers
This paper shows how to build a sparse tensor algebra compiler that is
agnostic to tensor formats (data layouts). We develop an interface that
describes formats in terms of their capabilities and properties, and show how
to build a modular code generator where new formats can be added as plugins. We
then describe six implementations of the interface that compose to form the
dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants
thereof. With these implementations at hand, our code generator can generate
code to compute any tensor algebra expression on any combination of the
aforementioned formats.
To demonstrate our technique, we have implemented it in the taco tensor
algebra compiler. Our modular code generator design makes it simple to add
support for new tensor formats, and the performance of the generated code is
competitive with hand-optimized implementations. Furthermore, by extending taco
to support a wider range of formats specialized for different application and
data characteristics, we can improve end-user application performance. For
example, if input data is provided in the COO format, our technique allows
computing a single matrix-vector multiplication directly with the data in COO,
which is up to 3.6 faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201
Sparse Tensor Transpositions
We present a new algorithm for transposing sparse tensors called Quesadilla.
The algorithm converts the sparse tensor data structure to a list of
coordinates and sorts it with a fast multi-pass radix algorithm that exploits
knowledge of the requested transposition and the tensors input partial
coordinate ordering to provably minimize the number of parallel partial sorting
passes. We evaluate both a serial and a parallel implementation of Quesadilla
on a set of 19 tensors from the FROSTT collection, a set of tensors taken from
scientific and data analytic applications. We compare Quesadilla and a
generalization, Top-2-sadilla to several state of the art approaches, including
the tensor transposition routine used in the SPLATT tensor factorization
library. In serial tests, Quesadilla was the best strategy for 60% of all
tensor and transposition combinations and improved over SPLATT by at least 19%
in half of the combinations. In parallel tests, at least one of Quesadilla or
Top-2-sadilla was the best strategy for 52% of all tensor and transposition
combinations.Comment: This work will be the subject of a brief announcement at the 32nd ACM
Symposium on Parallelism in Algorithms and Architectures (SPAA '20
Compiling Recurrences over Dense and Sparse Arrays
Recurrence equations lie at the heart of many computational paradigms
including dynamic programming, graph analysis, and linear solvers. These
equations are often expensive to compute and much work has gone into optimizing
them for different situations. The set of recurrence implementations is a large
design space across the set of all recurrences (e.g., the Viterbi and
Floyd-Warshall algorithms), the choice of data structures (e.g., dense and
sparse matrices), and the set of different loop orders. Optimized library
implementations do not exist for most points in this design space, and
developers must therefore often manually implement and optimize recurrences. We
present a general framework for compiling recurrence equations into native code
corresponding to any valid point in this general design space. In this
framework, users specify a system of recurrences, the type of data structures
for storing the input and outputs, and a set of scheduling primitives for
optimization. A greedy algorithm then takes this specification and lowers it
into a native program that respects the dependencies inherent to the recurrence
equation. We describe the compiler transformations necessary to lower this
high-level specification into native parallel code for either sparse and dense
data structures and provide an algorithm for determining whether the recurrence
system is solvable with the provided scheduling primitives. We evaluate the
performance and correctness of the generated code on various computational
tasks from domains including dense and sparse matrix solvers, dynamic
programming, graph problems, and sparse tensor algebra. We demonstrate that
generated code has competitive performance to handwritten implementations in
libraries
The Tensor Algebra Compiler
Tensor and linear algebra is pervasive in data analytics and the physical sciences. Often the tensors, matrices or even vectors are sparse. Computing expressions involving a mix of sparse and dense tensors, matrices and vectors requires writing kernels for every operation and combination of formats of interest. The number of possibilities is infinite, which makes it impossible to write library code for all. This problem cries out for a compiler approach. This paper presents a new technique that compiles compound tensor algebra expressions combined with descriptions of tensor formats into efficient loops. The technique is evaluated in a prototype compiler called taco, demonstrating competitive performance to best-in-class hand-written codes for tensor and matrix operations
Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture
We introduce Stardust, a compiler that compiles sparse tensor algebra to
reconfigurable dataflow architectures (RDAs). Stardust introduces new
user-provided data representation and scheduling language constructs for
mapping to resource-constrained accelerated architectures. Stardust uses the
information provided by these constructs to determine on-chip memory placement
and to lower to the Capstan RDA through a parallel-patterns rewrite system that
targets the Spatial programming model. The Stardust compiler is implemented as
a new compilation path inside the TACO open-source system. Using cycle-accurate
simulation, we demonstrate that Stardust can generate more Capstan tensor
operations than its authors had implemented and that it results in 138
better performance than generated CPU kernels and 41 better performance
than generated GPU kernels.Comment: 15 pages, 13 figures, 6 tables
Compiler Support for Sparse Tensor Computations in MLIR
Sparse tensors arise in problems in science, engineering, machine learning,
and data analytics. Programs that operate on such tensors can exploit sparsity
to reduce storage requirements and computational time. Developing and
maintaining sparse software by hand, however, is a complex and error-prone
task. Therefore, we propose treating sparsity as a property of tensors, not a
tedious implementation task, and letting a sparse compiler generate sparse code
automatically from a sparsity-agnostic definition of the computation. This
paper discusses integrating this idea into MLIR
The Sparse Abstract Machine
We propose the Sparse Abstract Machine (SAM), an abstract machine model for
targeting sparse tensor algebra to reconfigurable and fixed-function spatial
dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse
primitives that encompass a large space of scheduled tensor algebra
expressions. SAM dataflow graphs naturally separate tensor formats from
algorithms and are expressive enough to incorporate arbitrary iteration
orderings and many hardware-specific optimizations. We also present Custard, a
compiler from a high-level language to SAM that demonstrates SAM's usefulness
as an intermediate representation. We automatically bind from SAM to a
streaming dataflow simulator. We evaluate the generality and extensibility of
SAM, explore the performance space of sparse tensor algebra optimizations using
SAM, and show SAM's ability to represent dataflow hardware.Comment: 18 pages, 17 figures, 3 table
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
We introduce the Bayesian Compiler Optimization framework (BaCO), a general
purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO
provides the flexibility needed to handle the requirements of modern autotuning
tasks. Particularly, it deals with permutation, ordered, and continuous
parameter types along with both known and unknown parameter constraints. To
reason about these parameter types and efficiently deliver high-quality code,
BaCO uses Bayesian optimiza tion algorithms specialized towards the autotuning
domain. We demonstrate BaCO's effectiveness on three modern compiler systems:
TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For
these domains, BaCO outperforms current state-of-the-art autotuners by
delivering on average 1.36x-1.56x faster code with a tiny search budget, and
BaCO is able to reach expert-level performance 2.9x-3.9x faster